# Note that this chunk can be changed to use sample data to demonstrate code.
# Use this for original data, containing PHI
knitr::opts_knit$set(root.dir = normalizePath('../Final_Project_PHI/01_raw_data'))
# Use this for simulated data to use with github post
# knitr::opts_knit$set(root.dir = normalizePath('./sample_data'))library(tidyverse)
library(ggplot2)
library(lubridate)
# library(digest)
library(pROC)
library(knitr) # for table formatting
library(caret)
library(naivebayes) # implicitly used by caret
library(nnet) # implicitly used by caret for penalized multinomial regression
library(randomForest) # implicitly used by caret
library(xgboost)
library(doParallel)
library(pROC) #used for multiclass roc
library(DMwR) # used for SMOTE resampling
theme_set(theme_bw())| Abbreviation | |
|---|---|
| AAP | American Academy of Pediatrics |
| AOM | Acute Otitis Media |
| ASP | Antimicrobial Stewardship Program |
| CDS | Clinical Decision Support |
| ED | Emergency Department |
| QI | Quality Improvement |
Jonathan Beus BMIN503 Final Project
This project examines adherence to treatment guidelines for antibiotic duration for acute otitis media (common ear infection) for patients seen and discharged from our emergency department. An American Academy of Pediatrics (AAP) guideline was published in 2013 with soft recommendations regarding treatment duration AAP Guidleline which were reflected in a local “Clinical Pathway.” Essentially, while 10 days of therapy has classicaly been used, for patients 2 years and older, 7 days of therapy is likely adequate. Our antimicrobial stewardship program (ASP) has asked that we consider implementing clinical decision support to encourage clinicians to follow these guidelines. In preparation for this, we would like to evaluate past rates of adherence to these guidelines and to identify patient and provider characteristics associated with adherence or non-adherence to these guidelines. This information will serve to define the magnitude of the problem and potentially highlight patient or provider populations for additional intervention.
To explore this question, we are using an internal data warehouse, extracted from the Electronic Health Record. We have written SQL queries to extract the pertinent information, identifying patients seen in our ED for Otitis Media from July 2011 through June 2019.
| Name | Title | Division/Department | Primary Discussion Points |
|---|---|---|---|
| Joe Zorc, MD, MSCE | Professor of Pediatrics, Director of Emergency Information Systems | Department of Emergency Medicine | We discussed which factors were most likely to be associated with adherence to the guidelines, including provider type (trainee, advanced practice provider, attending), calendar year, and use of an Order Set. We also discussed potentially un-intended consequences of our CDS intervention including driving providers away from using the Order Set or encouraging providers to use broader-spectrum antibiotics. |
| Brandon Ku, MD | Associate Director for ED QI | Department of Emergency Medicine | We primarily discussed post-intervention assessment of the impact of a CDS-based QI intervention. Due to the rapid-cycle nature of improvement, we discussed using control charts for improvement monitoring. With this in mind, we set a goal of an absolute 10% increase in adherence to the guidelines for patients >= 2 years. |
| Talene Metjian, PharmD | Manager, Antimicrobial Stewardship Program | Antimicrobial Stewardship Program | The majority of non-adherence occurs in patients >= 2 years as the shorter duration was a change from prior practice. The use of 10 days in patients < 2 years was and is the standard duration. As a result, we should focus our analysis on the cohort of patients 2 years and older. |
Antimicrobial resistance is a significant threat to modern healthcare. Antimicrobial resistance is developing more rapidly than new antimicrobials can be discovered. It is generally felt that antibiotic overuse is a major factor driving worsening resistance patterns. Antimicrobial stewardship is a field developed to help providers limit inappropriate antibiotic use. As discussed above, in 2013 an AAP policy guideline noted that, for patients >= 2 years, 7 days of antibiotic therapy was likely as effective as longer durations. The policy was well received, but probably more notable in that it provided guidance to providers about when not treating with antibiotics could be appropriate. The discussion about duration seems to have been overshadowed by the more surprising endorsement to not treat in specific cases. For many providers, when the decision was made to treat a patient with antibiotics, they continued to use 10 days of therapy.
When a patient diagnosed with AOM is discharged from our ED, there are several ways that an order for antibiotics can be placed. The provider could go to their “order tab” and enter the name of the antibiotic they would like to prescribe and, subsequently, the dosing and duration. This occurs some in the ED, but most ED providers have become accustomed to using a discharge Order Set. There is such an Order Set that currently exists for ear complaints. Historically, it has provided a list of antibiotics for providers to choose from (in addition to the ability to add a discharge diagnosis, instructions, etc). The antibiotic group is shared with Order Sets for other kinds of problems that might present to the ED and has historically been defaulted to 10 days. This has made ordering 10 days of antibiotics, the easiest thing to do. While outside the scope of the current report, we are implementing some features in the Order Set to help providers choose an appropriate antibiotic and also to guide them to the recommended duration based on age.
The goal of the current analysis and report is preparatory to CDS interventions and assessment of their effectiveness. As discussed above, we are interested in patient and provider factors that influence the selection of antibiotic duration.
For this project, we have primarily gathered resources with expertise in Clinical Informatics, Antimicrobial Stewardship (pharmacy), and Quality Improvement. Clinical informatics generally includes clinicians and draws from disciplines including human-computer interaction, decision science, analytics, and computer science. Clinical informatics is a burgeoning field and includes broad definitions, although here, we will generally consider its primary intent to be to help clinicians and patients by supporting quality, timely, efficient, and safe care through the use of information technology. While a software engineer might easily be able to add antibiotics and durations to an Order Set, a clinical informaticist gathers additional information from providers and/or their own clinical experience to think about the workflow by which a process is performed to improve the likelihood that the anticipated goal will be achieved in a way that supports, rather than disrupts, the provider’s clinical activities. The clinical informaticists involved (JZ and JB) have extensively discussed which dependent-variables are reasonably extracted from our data warehouse and might be informative in targeting CDS.
Our ASP pharmacist (TM) has been crucial to help define which clinical actions are desired and to incoporate the appropriate antibiotics as well as durations for appropriate patients. Our ASP pharmacist and QI expert (BK) have been instrumental in confirming the decision to focus our analysis on patients >= 2 years and to exclude patients prescribed azithromycin. Azithromycin is not a preferred antibiotic for the treatment of AOM, but is sometimes used in patients with sever-allergies to the preferred agents. Unlike other antibiotics used to treat AOM, the duration of azithromycin is dependent on the dose used, not associated with age, and typically just 3 to 5 days. As a result, for patients treated with azithromycin, the prescription would almost always be adherent to the guideline to use 7 days of therapy. Its inclusion would thus artificially inflate the estimate of adherence to the prescribing recommendations.
The data is derived from an institutional data warehouse. JB wrote SQL queries to extract the data from the data warehouse. We included data for patients seen in the ED between July 1, 2011 and June 30, 2018. Patients were included if:
The primary analysis shown below depends on a patient having received an antibiotic prescription. The data warehouse provides a unique key for visits, which was used to identify antibiotic prescriptions for patients in the cohort. We did not include antibiotics administered in the ED, only outpatient prescriptions. We primarily identified antibiotics using standardized classifications from the EHR. Identification of a standard antibiotic name was preformed in the database queries. The name can be derived from various fields, none of which are completely reliable (some are missing). The names or descriptions tend to include descriptions of the concentrations and/or the salt of the molecule. A series of regular expressions was used to extract the “base” or generce name such as "amoxicllin).
In general, we use descriptive statistics to examine variables associated with adherence to the recommendations. These variables were chosen based on clinical expertise and suspected to impact adherence to the guidelines. In addition to descriptive statistics, we additionally used logistic regression to examine the contribution of each variable when accounting for the others.
# Path to PHI data
read_path <- normalizePath('./')
# Path to local non-PHI data
# read_path <- normalizePath('../../github_final_project/BMIN503_Final_Project/sample_data')
# Read Data
cohort <- read_csv(paste0(read_path, '/cohort.csv'))
ccc <- read_csv(paste0(read_path, '/ccc.csv'))
race <- read_csv(paste0(read_path, '/race.csv'))
meds <- read_csv(paste0(read_path, '/meds.csv'))
# Used to associated provider type with medication order
prov_hx <- read_csv(paste0(read_path, '/prov_hx.csv'))
allergy <- read_csv(paste0(read_path, '/allergy.csv'))abx_all <- allergy %>%
select(VISIT_KEY, ALRG_DESC) %>%
filter(!is.na(ALRG_DESC)) %>%
group_by(VISIT_KEY) %>%
mutate(
pcn_all = ifelse(str_detect(ALRG_DESC,pattern = regex('(cillin|augmentin|zosyn)',ignore_case = T)),1,0),
azith_all = ifelse(str_detect(ALRG_DESC,pattern = regex('(azithro)',ignore_case = T)),1,0),
cef_all = ifelse(str_detect(ALRG_DESC,pattern = regex('ce(f|ph)',ignore_case = T)),1,0),
) %>%
mutate(
pcn_all = max(pcn_all),
azith_all = max(azith_all),
cef_all = max(cef_all)
) %>%
select(-ALRG_DESC) %>%
arrange(VISIT_KEY) %>%
filter(sum(pcn_all, azith_all, cef_all) > 0) %>%
distinct()The provider type history is stored in a different database, so it is not possible to join in the initial query.
meds.cln <- meds
# join to provider type at appropriate time (provider types can change as providers progress in training/etc)
prov_hx.meds <- meds.cln %>%
select(MED_ORD_KEY, MED_ORD_CREATE_DT, PROV_ID) %>%
left_join(prov_hx, by = 'PROV_ID') %>%
filter(MED_ORD_CREATE_DT >= CONTACT_DATE, MED_ORD_CREATE_DT < EFF_TO_DATE) %>%
select(MED_ORD_KEY, PROV_TYPE)
prov_hx.meds %>%
summarise(n_distinct(MED_ORD_KEY))## # A tibble: 1 x 1
## `n_distinct(MED_ORD_KEY)`
## <int>
## 1 48103
meds.cln <- meds.cln %>%
left_join(prov_hx.meds, by = 'MED_ORD_KEY')
# determined these classifications by manual review of provider names
meds.cln <- meds.cln %>%
mutate(
PROV_TYPE = ifelse(PROV_ID %in% c(18941,19355,8743), 'Resident', PROV_TYPE),
PROV_TYPE = ifelse(PROV_ID %in% c(21864), 'Physician', PROV_TYPE)
) %>%
select(-contains(('PROV_ID')))Create variables from cohort data that will be used for exploration and modeling.
cohort.cln <- cohort.cln %>%
filter(AGE_AT_VISIT < 21) %>% # include only patients < 21 as we are interested in pediatric patients
mutate(
ACUITY = factor(ACUITY),
sex_f = factor(SEX),
arr_yr = year(ARRIVE_ED_DT), #extract year from date for grouping
arr_mo = month(ARRIVE_ED_DT), #extract month from date
fy_yr = ifelse(arr_mo < 7, arr_yr, arr_yr + 1),
arr_yr_mo = floor_date(ARRIVE_ED_DT,'month'), # keep year-month
arr_yr_qtr = floor_date(ARRIVE_ED_DT, unit = 'quarter'), # keep year-quarter
arr_yr_wk = floor_date(ARRIVE_ED_DT, unit = 'week'),
) %>%
mutate(
age_ge_2 = ifelse(AGE_AT_VISIT >= 2, 1, 0), #binary age over 2 - this will be primary population
age_ge_2 = factor(age_ge_2, levels = c('0', '1'), labels = c('Age < 2 yrs', 'Age >= 2 yrs')),
age_f = cut(floor(AGE_AT_VISIT), breaks = c(0,6/12,1,2,5,12,Inf), labels = c('0-6 months','6-12 months','1-2 years','2-5 years','5-12 years','12+ years'), right = F), #splint into reasonable ages ranges based on clinical reasoning
age_yr = floor(AGE_AT_VISIT) # essentially round age to first digit for use as continuous variable (little interest in more precision than this)
) %>%
select(-AGE_AT_VISIT, -ARRIVE_ED_DT) %>% # don't need this anymore
mutate(
race_f = factor(RACE),
ethn_f = factor(ETHNICITY)
) %>% # factor these for use in models
select(-RACE, -ETHNICITY, SEX) %>%
mutate(
MED_COMPLEX_IND = factor(MED_COMPLEX_IND, levels = c(0,1), labels = c('No CCC','CCC Present'))
)meds.cln <- meds.cln %>%
mutate(med_yr = year(MED_ORD_CREATE_DT),
med_mo = month(MED_ORD_CREATE_DT),
med_hr = hour(MED_ORD_CREATE_DT),
med_hr_cut = cut(med_hr,breaks = seq(0, 24, 3),right = F),
med_dow = wday(MED_ORD_CREATE_DT, label = T),
med_dow_fri_sat = ifelse(med_dow %in% c('Fri','Sat'), 1, 0)
) %>%
mutate(abx_name = factor(GEN_NM_CLN_SUB)) %>%
select(-GEN_NM_CLN_SUB)Here we only include outpatient prescriptions (not antibiotics administered in the emergency department). The data is a prefiltered in the SQL queries to contain only medications classified as “Anti-Infective Agents” (which is mutually exclusive of topical antimicrobial products). We also exclude anti-infective agents that are not Antibacterial.
meds.cln.op <- meds.cln %>%
filter(D_ORD_MODE == 'Outpatient') %>%
filter(D_ORD_CLASS != 'Historical Med') %>%
filter(!D_PHARM_CLASS %in% c('Antiviral','Anthelmintic','Antifungals','Antimalarial','Antimycobacterial agents'))
length(unique(meds.cln.op$VISIT_KEY))## [1] 28410
We need to align how dates/times are coded. DISCONT_DT is stored in UTC, while ARR_ED_DT is stored in local timezone, so need to adjust DISCONT_DT in order to assess time to discontinuation of an order.
meds.cln.op <- meds.cln.op %>%
mutate(
med_disc_dt = DISCONT_DT - hours(4),
disc_min = time_length(med_disc_dt - MED_ORD_CREATE_DT, unit = 'minutes')
) %>%
select(-DISCONT_DT) %>%
mutate(
ss_used_med = factor(MED_ORD_SS, levels = c(0, 1), labels = c('No Order Set','Used Order Set'))
) %>%
select(-MED_ORD_SS)If there are multiple antibiotic orders we include only the most recent one. Some medication discontinuations occur at follow-up visits with primary care doctors or ED re-visits. Others reflect a change to the plan of care. We look at the distribution of discontinuation times to select a cutoff which is likely to represent a balance between these two possibilities.
meds.cln.op <- meds.cln.op %>%
arrange(VISIT_KEY,med_disc_dt) %>%
group_by(VISIT_KEY) %>%
# group_by(VISIT_KEY,GEN_NM_CLN_SUB) %>%
mutate(vk_abx_ct = n(),
any_disc = ifelse(any(!is.na(med_disc_dt)),1,0), # any order discontinued
all_disc = ifelse(all(!is.na(med_disc_dt)),1,0), # all orders discontinued
ord_seq = row_number())
# majority of discontinuations happen in first 2-3 hours
meds.cln.op %>%
filter(all_disc == 1) %>%
ggplot(aes(disc_min/60)) +
geom_histogram(binwidth = 1) +
xlim(0,24) +
xlab('Hours to Discontinuation')nrow_pre <- nrow(meds.cln.op)
meds.cln.op <- meds.cln.op %>%
ungroup() %>%
filter(is.na(med_disc_dt) | disc_min >= 120) %>%
filter(vk_abx_ct == ord_seq)
paste0('Rows Dropped: ',nrow_pre - nrow(meds.cln.op))## [1] "Rows Dropped: 1606"
To assess guidelines adhrenece, we need to know the duration for which the antibiotic was prescribed. The information is not recorded in an easily accessible discrete way.
##
## 2 TIMES DAILY 3 TIMES DAILY 4 TIMES DAILY DAILY DEFAULT EVERY 12 HOURS EVERY 24 HOURS EVERY 4 HOURS PRN EVERY 6 HOURS EVERY 8 HOURS ONCE <NA>
## 25923 1144 7 925 29 234 4 1 13 25 22 0
# Use ordered frequency from a discrete field to determine number of doses per day
times_daily <- meds.cln.op %>%
filter(FREQ_NM != 'EVERY 4 HOURS PRN') %>%
mutate(daily_or_once = ifelse(FREQ_NM %in% c("DAILY","ONCE"),1,0),
times_daily = as.numeric(str_extract(FREQ_NM,regex('\\b\\d(?=( times daily))',ignore_case = T))),
every_x_hrs = as.numeric(str_extract(FREQ_NM,regex('(?<=every )\\d{1,2}(?= hours)',ignore_case = T)))) %>%
mutate(times_daily = ifelse(is.na(times_daily) & daily_or_once == 1, 1, times_daily),
times_daily = ifelse(is.na(times_daily) & !is.na(every_x_hrs),(24/every_x_hrs),times_daily)) %>%
group_by(FREQ_NM,daily_or_once,times_daily,every_x_hrs) %>%
summarise(n())
# create lookup table for frequency
times_daily <- times_daily %>%
group_by(FREQ_NM, times_daily) %>%
select(FREQ_NM,times_daily)
# combine with lookup table
meds.cln.op <- meds.cln.op %>%
left_join(times_daily, by='FREQ_NM')The “SIG” is the historically hand-written and now, free-text part of the medication prescription indicating the instructions for the patient. The prescription duration is not available as a discrete field in our database, so we need to do some cleaning work to extract it from the text. When the duration is not explicitly stated in the text, we also have information about the total volume that was prescribed. We can use this data along with the frequency and dose to estimate the prescribed duration.
set.seed(1234) # for sample to display below
# show examples of SIGs that we will be cleaning
meds.cln.op %>%
group_by(abx_name) %>%
mutate(abx_wt = 1/n()) %>%
ungroup() %>%
sample_n(size = 10, weight = abx_wt) %>%
select(abx_name, SIG) %>%
kable()| abx_name | SIG |
|---|---|
| Penicillin V | Take 10 mL (250 mg total) by mouth every 8 hours for 10 days. |
| Cephalexin | Take 4 mL (200 mg total) by mouth every 6 hours for 10 days. |
| Cephalexin | Take 4.9 mL (245 mg total) by mouth 4 times daily for 10 days. |
| Cephalexin | Take 8 mL (400 mg total) by mouth every 6 hours for 10 days. |
| Cefdinir | Take 3.4 mL (170 mg total) by mouth once a day for 10 days. |
| Cephalexin | Take 5.5 mL (275 mg total) by mouth every 6 hours for 7 days. |
| Erythromycin | Take 1 tablet(s) (500 mg total) by mouth 3 times daily for 3 days. |
| Levofloxacin | Take 5.2 mL (130 mg total) by mouth 2 times daily for 10 days. |
| Clindamycin | Take 7.5 mL (112.5 mg total) by mouth 3 times daily for 10 days. |
| Cefuroxime | Take 2.8 mL by mouth 2 times daily for 10 days. |
# regex pattern used in one of the cleaning steps below
patt <- paste0('\\b(for|X)\\s+(the|next|\\s+)*(\\d+|','four|',')\\s+(more\\s+)?(day(s)?|dose(s)?)\\b')
# Parse sig, primarily using regular expressions
tmp <- meds.cln.op %>%
select(MED_ORD_KEY, abx_name, FREQ_NM, SIG, times_daily, MED_ORD_QTY, MED_ORD_QTY_VAL) %>%
mutate(sig_ext_1st = str_extract(SIG,regex(patt,ignore_case = T)), #first pass gets at the majority of cases
sig_num = str_extract(sig_ext_1st,regex('\\b\\d+\\b',ignore_case = T)), # number of doses or days
sig_unit = str_extract(sig_ext_1st,regex('\\b(dose|day)(s)?\\b',ignore_case = T)), # doses versus days
sig_then = str_extract(SIG,regex(paste0('\\b(?<=then)[\\w\\.\\s\\(\\)\\,]+',patt),ignore_case = T)), # case when there are two sets of instructions
sig_then_form = str_extract(sig_then,regex(patt,ignore_case = T)),
sig_then_num = str_extract(sig_then_form,regex('\\b\\d+|four\\b',ignore_case = T)),
sig_then_num = ifelse(sig_then_num == 'four',4,sig_then_num),
sig_then_unit = str_extract(sig_then_form,regex('\\b(dose|day)(s)?\\b',ignore_case = T)),
ed_1st = str_extract(SIG,regex('\\b(1st|first)\\s+(dose)\\s+(given in ED)\\b',ignore_case = T))) %>%
mutate(rx_days = ifelse(sig_unit %in% c('day','days') & sig_num != 1, sig_num,NA),
rx_days = ifelse(sig_unit %in% c('dose','doses') & !is.na(times_daily) & is.na(rx_days),as.numeric(sig_num)/as.numeric(times_daily),rx_days),
rx_days = ifelse(!is.na(sig_then_num) & abx_name != 'Azithromycin',as.numeric(sig_num) + as.numeric(sig_then_num), rx_days),
rx_days = ifelse(!is.na(sig_then_num) & sig_num == 1, as.numeric(sig_num) + as.numeric(sig_then_num), rx_days),
rx_days = ifelse(is.na(rx_days) & !is.na(sig_num),sig_num,rx_days),
rx_days = ifelse(abx_name == 'Azithromycin' & rx_days %in% c(2,4) & !is.na(ed_1st),as.numeric(rx_days) + 1, rx_days), #azithromycin tends to be 3 or 5 days
rx_days = ifelse(is.na(rx_days) & !is.na(str_extract(SIG,regex('\\bday(s)?[\\s\\-\\#]+\\d+\\b',ignore_case = T))),5,rx_days), #capture remaining azithromycin 2-5 days
rx_days = as.numeric(rx_days)
) %>% # sig dose extraction
mutate(
sig_dose = str_extract(SIG, regex('\\b[\\d\\.(TWO)]+\\s+(capsule\\(s\\)|tablet\\(s\\)|mL)+',ignore_case = T)),
sig_dose = str_replace(sig_dose,'TWO','2'), # very specific to this data, in future could generalize if needed
sig_dose_num = str_extract(sig_dose,regex('\\b[\\d\\.]+\\b',ignore_case = T)),
ord_qty_num = str_extract(MED_ORD_QTY_VAL,regex('\\b[\\d\\.]+\\b',ignore_case = T)),
ord_qty_unit = str_extract(MED_ORD_QTY_VAL,regex('(day|mL|bottle|capsule|tablet)',ignore_case = T)),
fill_est_days = as.numeric(ifelse(ord_qty_unit == 'day', ord_qty_num, NA)),
fill_est_days = (ifelse(is.na(fill_est_days) & ord_qty_unit %in% c('mL','capsule','tablet'), as.numeric(ord_qty_num)/(as.numeric(sig_dose_num)*as.numeric(times_daily)), fill_est_days)),
fill_est_days = ifelse(MED_ORD_QTY_VAL == '30', 30/times_daily,fill_est_days), #edge case
fill_est_days = ifelse(fill_est_days <= 1, NA, round(fill_est_days)),
rx_days = ifelse(is.na(rx_days),fill_est_days,rx_days)
)
set.seed(1234) # for sample to display below
# Show examples of duration extracted from sig
tmp %>%
group_by(abx_name) %>%
mutate(abx_wt = 1/n()) %>%
ungroup() %>%
sample_n(size = 10, weight = abx_wt) %>%
select(Antibiotic = abx_name, Rx_Sig = SIG, Duration = rx_days) %>%
kable()| Antibiotic | Rx_Sig | Duration |
|---|---|---|
| Penicillin V | Take 10 mL (250 mg total) by mouth every 8 hours for 10 days. | 10 |
| Cephalexin | Take 4 mL (200 mg total) by mouth every 6 hours for 10 days. | 10 |
| Cephalexin | Take 4.9 mL (245 mg total) by mouth 4 times daily for 10 days. | 10 |
| Cephalexin | Take 8 mL (400 mg total) by mouth every 6 hours for 10 days. | 10 |
| Cefdinir | Take 3.4 mL (170 mg total) by mouth once a day for 10 days. | 10 |
| Cephalexin | Take 5.5 mL (275 mg total) by mouth every 6 hours for 7 days. | 7 |
| Erythromycin | Take 1 tablet(s) (500 mg total) by mouth 3 times daily for 3 days. | 3 |
| Levofloxacin | Take 5.2 mL (130 mg total) by mouth 2 times daily for 10 days. | 10 |
| Clindamycin | Take 7.5 mL (112.5 mg total) by mouth 3 times daily for 10 days. | 10 |
| Cefuroxime | Take 2.8 mL by mouth 2 times daily for 10 days. | 10 |
# when we look at examples where no duration was extracted, there are only a few cases where there is not enough information available to infer it (~ 8 cases)
# we also looked at additional subsets, such as unusual durations, and use of "doses" rather than "days", all of which yielded expected results.
# tmp %>%
# filter(rx_days > 10)
meds.cln.op <- tmp %>%
select(MED_ORD_KEY, rx_days) %>%
left_join(meds.cln.op,by='MED_ORD_KEY') %>%
select(VISIT_KEY, rx_days, abx_name, ss_used_med, PROV_TYPE, med_hr, med_hr_cut, med_dow) # drop variables that are no longer needed# Condense provider type to sensical values.
meds.cln.op <- meds.cln.op %>%
mutate(
prov_rx = ifelse(PROV_TYPE %in% c('Medical Student','Resident'), PROV_TYPE, NA),
prov_rx = ifelse(PROV_TYPE %in% c('FELLOW','FELLOW-CRNA'), 'Fellow', prov_rx),
prov_rx = ifelse(PROV_TYPE %in% c('Nurse Practitioner','Physician Assistant'), 'NP/PA', prov_rx),
prov_rx = ifelse(PROV_TYPE %in% c('Physician', 'Physician with Sedation'), 'Attending', prov_rx),
prov_rx = factor(prov_rx)
) %>%
select(-PROV_TYPE)# all refers to the fact that no limits are being placed on the timing of the visit
meds.coh.all <- cohort.cln %>%
left_join(meds.cln.op, by='VISIT_KEY') %>%
filter(!is.na(rx_days)) %>%
mutate(
dur_compl = 0,
dur_compl = ifelse(age_ge_2 == 'Age >= 2 yrs' & rx_days <= 7, 1, dur_compl),
dur_compl = ifelse(age_ge_2 == 'Age < 2 yrs' & rx_days <= 10, 1, dur_compl),
dur_compl = ifelse(abx_name == 'Azithromycin' & rx_days <= 5, 1, dur_compl),
dur_compl = factor(dur_compl, levels = c('0','1'), labels = c( 'Prolonged Duration' , 'Guideline Followed'))
) %>%
mutate(
abx_name = factor(abx_name) # refactor to drop NAs induced by joining to cohort
) %>%
select(-VISIT_KEY, -PAT_KEY)
# for the majority of our analysis we only want visits that occurred through the end of FY 2019
meds.coh <- meds.coh.all %>%
filter(arr_yr_mo >= '2011-07-01' & arr_yr_mo <= '2019-06-30')After cleaning, there are 27156 unique visits. This indicates that 87 % of ED patients had a matching antibiotic. On review of a sample (20) of cases without a match, either no antibiotic was prescribed or the patient received an antibiotic by injection in the ED.
Our primary focus will be on patients >= 2 years old, since this is the group in which the guidelines and our clinical pathway specify a different duration than has been standard. Additionally, we really only are interested in the most frequently used antibiotics. Finally, azithromcyin is not informative, since the duration is almost always 5 days, regardless of the indication (and is not addressed by the guidelines that are referenced).
We examine the frequency with which each antibiotic has been prescribed.
coh.sub %>%
group_by(abx_name) %>%
summarise(n_visits = n()) %>%
arrange(desc(n_visits)) %>%
rename(Antibiotic = abx_name) %>%
mutate(Antibiotic = fct_reorder(Antibiotic, n_visits, .desc = T)) %>%
ggplot(aes(x = Antibiotic, y = n_visits)) +
geom_col() +
geom_text(aes(label=n_visits), nudge_y = .25) +
scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.05)) +
ylab('# Visits')Based on review of the data, we will only include cases where the antibiotic was used with some reasonable frequency (> 10 times). This is somewhat arbitrary, although we don’t want to focus our efforts on rare cases. We will also exclude azithromycin.
Adherence to the guidelines is much lower in patients >= 2 years of age.
# summarise data
tab <- meds.coh %>%
group_by(
age_ge_2, dur_compl
) %>%
summarise(n = n()) %>%
mutate(age_sum = sum(n),
pct_compl = 100*n/age_sum)
# turn into a matrix and perform chi-square test
chi <- tab %>%
select(-age_sum, - pct_compl) %>%
pivot_wider(
names_from = dur_compl,
values_from = n
) %>%
column_to_rownames(var = 'age_ge_2') %>%
chisq.test(as.matrix())
# chi-square test
chi##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 20280, df = 1, p-value < 2.2e-16
tab %>%
filter(dur_compl == 'Guideline Followed') %>%
ggplot(aes(x = age_ge_2, y = pct_compl)) +
geom_col(position = 'dodge') +
geom_text(aes(y = pct_compl + 3, label = round(pct_compl, 1))) +
xlab('Age') +
ylab('Percent Adherent to Guidelines') +
theme(legend.title = element_blank()) Examine the distribution of ages - patients tend to be on the young side with an approximately exponential distribution.
There are more males than females in the cohort.
coh.sub %>%
mutate(var = SEX) %>%
group_by(var) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum) %>%
arrange(desc(n)) %>%
mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0('n = ',n,' (',round(100*prop, 2),'%)')), nudge_y = .02) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=0)) +
xlab('Sex') +
ylab('Proportion') The majority of patients are “Black or African American.”
coh.sub %>%
mutate(var = race_f) %>%
group_by(var) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum) %>%
arrange(desc(n)) %>%
mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(angle = 0, label=paste0('n = ',n,' \n(',round(100*prop, 2),'%)')), nudge_y = .15) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Race') +
ylab('Proportion') +
ylim(0,1)There is some variation in volume from year to year.
coh.sub %>%
ggplot(aes(fy_yr)) +
geom_histogram(binwidth = 0.5) +
xlab('Fiscal Year') +
ylab('Case Count')Only a small fraction of patients had a complex chronic condition (“CCC”). This feature is defined by a variety of diagnoses on a patient’s problem list or their dependence on technologic interventions (i.e. chronic need for ventilator).
coh.sub %>%
mutate(var = MED_COMPLEX_IND) %>%
group_by(var) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum) %>%
arrange(desc(n)) %>%
mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0('n = ',n,' (',round(100*prop, 2),'%)')), nudge_y = .05) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Complex Chronic Condition') +
ylab('Proportion') Summary of antibiotics prescribed after filtering above. Amoxicillin represents a significant majority of cases.
coh.sub %>%
mutate(var = abx_name) %>%
group_by(var) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum) %>%
arrange(desc(n)) %>%
mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col(aes(fill = var)) +
geom_text(aes(label=paste0('n = ',n,' \n(',round(100*prop, 2),'%)')), nudge_y = .1) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Antibiotic') +
ylab('Proportion') +
ylim(0,1) +
scale_fill_brewer(type = 'qualitative', palette = 'Set1') +
theme(legend.position = 'none') A histogram of prescribed antibiotic duration providers more granular detail than the binary classifer of adherence to recommendations. The majority of prescriptions were written for 7 or 10 days with small numbers written for 5, 9, or 14 days.
coh.sub %>%
ggplot(aes(x=rx_days)) +
geom_histogram(binwidth = 1) +
# scale_y_log10() +
xlab('Prescribed Duration') +
ylab('# Cases')# look at duration by antibiotic
coh.sub %>%
ggplot(aes(x=rx_days, stat(density), color = abx_name)) +
geom_freqpoly(binwidth = 1) +
# scale_y_log10() +
xlab('Prescribed Duration') +
ylab('Density')Antibiotic prescriptions were fairly evenly distributed throughout the week with slightly more prescriptions on weekends.
coh.sub %>%
group_by(med_dow) %>%
summarise(n = n()) %>%
mutate(
sum = sum(n),
prop = n/sum
) %>%
ggplot(aes(x=med_dow, y=prop)) +
geom_col() +
xlab('Day of the Week') +
ylab('Proportion')When we look at prescriptions by time of day, we see a pattern that generally resembles patient flow throughout the day. Volume is lowest in the early morning hours to mid-morning and busiest late in the evening.
coh.sub %>%
group_by(med_hr) %>%
summarise(n = n()) %>%
mutate(
sum = sum(n),
prop = n/sum
) %>%
ggplot(aes(x=med_hr, y=prop)) +
geom_line() +
xlab('Hour of the Day') +
ylab('Proportion of Cases')The majority of prescriptions were written by attending providers (ED or Urgent care) with other significant proportions written by residents and advanced practice providers.
coh.sub %>%
mutate(var = prov_rx) %>%
group_by(var) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum) %>%
arrange(desc(n)) %>%
mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col(aes(fill=var)) +
geom_text(aes(label=paste0('n = ',n,' \n(',round(100*prop, 2),'%)')), nudge_y = .1) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Prescriber Type') +
ylab('Proportion') +
ylim(0,1) +
scale_fill_brewer(type = 'qualitative', palette = 'Paired') +
theme(legend.position = 'none') The majority of antibiotic orders were placed using the Order Set, which supports targeting this method as an intervention to improve adherence. Still, nearly 15% of orders were placed outside of the Order Set.
coh.sub %>%
mutate(var = ss_used_med) %>%
group_by(var) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum) %>%
arrange(desc(n)) %>%
mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0('n = ',n,' \n(',round(100*prop, 2),'%)')), nudge_y = .1) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Order Set Use') +
ylab('Proportion') +
ylim(0,1)The guideline was more more likely to be followed as patients get older.
should make this GLM
f <- glm(dur_compl ~ age_yr, family=binomial, data=coh.sub)
broom::tidy(f, conf.int =T, exponentiate = T)## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0969 0.0448 -52.2 0. 0.0887 0.106
## 2 age_yr 1.08 0.00676 11.7 1.52e-31 1.07 1.10
coh.sub %>%
mutate(var = age_yr) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum) %>%
filter(dur_compl == 'Guideline Followed') %>%
ggplot(aes(x = var, y = prop)) +
# geom_smooth(method = 'glm', method.args = list(family = "binomial")) +
geom_line() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Patient Age') +
ylab('Guideline Followed - Proportion') +
ylim(0,1)Sex does not appear to affect adherence to the guideline.
tab <- coh.sub %>%
mutate(var = SEX) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum)
## in the future, could make this a function since repeat multiple times - more challenging due to non-standard evaluation with dplyr
# turn into a matrix
mat <- tab %>%
select(-sum, -prop) %>%
pivot_wider(
names_from = dur_compl,
values_from = n
) %>%
column_to_rownames('var') %>%
as.matrix()
# Fisher's Exact Test
fisher.test(mat)##
## Fisher's Exact Test for Count Data
##
## data: mat
## p-value = 0.9375
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8971853 1.1041480
## sample estimates:
## odds ratio
## 0.9952337
tab %>%
filter(dur_compl == 'Guideline Followed') %>%
# arrange(desc(n)) %>%
# ungroup() %>%
# mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0(round(100*prop, 2),'%')), nudge_y = .05) +
# scale_y_log10() +
# theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Sex') +
ylab('Guideline Followed - Proportion') +
ylim(0,1) # scale_color_discrete('Adherence to Guidelines')
# scale_fill_brewer(type = 'qualitative', palette = 'Paired') +
# theme(legend.position = 'none')Race has a small effect on guideline adherence, although many races are quite under-represented. When we look at the most frequent races from above (black, other, white), adherence to the guideline seemed to be quite similar.
tab <- coh.sub %>%
mutate(var = race_f) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum)
# turn into a matrix
mat <- tab %>%
select(-sum, -prop) %>%
pivot_wider(
names_from = dur_compl,
values_from = n
) %>%
filter(!is.na(var)) %>%
column_to_rownames('var') %>%
as.matrix()
# Fisher's Exact Test
fisher.test(mat, simulate.p.value = T)##
## Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
##
## data: mat
## p-value = 0.05847
## alternative hypothesis: two.sided
# to assess by factor level - univariable logistic regression
fit <- glm(data = coh.sub %>%
mutate(race_f = fct_relevel(race_f, 'Other'))
, formula = dur_compl ~ race_f, family = binomial)
summary(fit)##
## Call:
## glm(formula = dur_compl ~ race_f, family = binomial, data = coh.sub %>%
## mutate(race_f = fct_relevel(race_f, "Other")))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9005 -0.5171 -0.5171 -0.5050 2.1460
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.952353 0.071883 -27.160 <2e-16 ***
## race_fAmerican Indian or Alaska Native -0.244871 1.056454 -0.232 0.8167
## race_fAsian 0.321606 0.153505 2.095 0.0362 *
## race_fBlack or African American 0.007688 0.078503 0.098 0.9220
## race_fIndian 0.529245 0.259259 2.041 0.0412 *
## race_fMulti-racial -0.164571 0.190492 -0.864 0.3876
## race_fNative Hawaiian or Other Pacific Islander 1.259206 0.710751 1.772 0.0765 .
## race_fRefused -0.062550 0.756189 -0.083 0.9341
## race_fWhite -0.042747 0.106028 -0.403 0.6868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10125 on 13384 degrees of freedom
## Residual deviance: 10112 on 13376 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 10130
##
## Number of Fisher Scoring iterations: 4
tab %>%
filter(dur_compl == 'Guideline Followed' & !is.na(var)) %>%
arrange(desc(n)) %>%
ungroup() %>%
mutate(var = fct_reorder(var, prop, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0(round(100*prop, 2),'%')), nudge_y = .05) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Race') +
ylab('Guideline Followed - Proportion') +
ylim(0,1) # scale_color_discrete('Adherence to Guidelines')
# scale_fill_brewer(type = 'qualitative', palette = 'Paired') +
# theme(legend.position = 'none')Adherence to the guideline has improved over time, although still remains relatively low.
# logistic regression for association between adherence and year of visit
f <- glm(dur_compl ~ arr_yr, family = binomial, data = coh.sub)
broom::tidy(f, exponentiate = T, conf.int = T) # the meaning of the estimate for year-mo would be difficult to interpret## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.59e-261 25.8 -23.3 5.86e-120 3.04e-283 2.13e-239
## 2 arr_yr 1.35e+ 0 0.0128 23.2 2.93e-119 1.31e+ 0 1.38e+ 0
coh.sub %>%
mutate(var = arr_yr_mo) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum) %>%
filter(dur_compl == 'Guideline Followed') %>%
# arrange(desc(n)) %>%
# ungroup() %>%
# mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_line() +
# geom_text(aes(label=paste0('n = ',n,' \n(',round(100*prop, 2),'%)')), nudge_y = .1) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('ED Visit Month') +
ylab('Guideline Followed - Proportion') +
ylim(0,1) # scale_color_discrete('Adherence to Guidelines')
# scale_fill_brewer(type = 'qualitative', palette = 'Paired') +
# theme(legend.position = 'none')The presence of a “complex chronic condition” does not appear to influence adherence to the guidelines.
tab <- coh.sub %>%
mutate(var = MED_COMPLEX_IND) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum)
# turn into a matrix
mat <- tab %>%
select(-sum, -prop) %>%
pivot_wider(
names_from = dur_compl,
values_from = n
) %>%
column_to_rownames('var') %>%
as.matrix()
# Fisher's Exact Test
fisher.test(mat)##
## Fisher's Exact Test for Count Data
##
## data: mat
## p-value = 0.403
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.6154442 1.1839156
## sample estimates:
## odds ratio
## 0.8625624
tab %>%
filter(dur_compl == 'Guideline Followed') %>%
# arrange(desc(n)) %>%
# ungroup() %>%
# mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0(round(100*prop, 2),'%')), nudge_y = .05) +
# scale_y_log10() +
# theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('') +
ylab('Guideline Followed - Proportion') +
ylim(0,1)There is some variation in adherence to the guidelines by which antibiotic was prescribed.
tab <- coh.sub %>%
mutate(var = abx_name) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum)
# turn into a matrix
mat <- tab %>%
select(-sum, -prop) %>%
pivot_wider(
names_from = dur_compl,
values_from = n
) %>%
column_to_rownames('var') %>%
as.matrix()
# Fisher's Exact Test
fisher.test(mat)##
## Fisher's Exact Test for Count Data
##
## data: mat
## p-value = 0.00249
## alternative hypothesis: two.sided
coh.sub %>%
mutate(var = abx_name) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum) %>%
filter(dur_compl == 'Guideline Followed' & !is.na(var)) %>%
arrange(desc(n)) %>%
ungroup() %>%
mutate(var = fct_reorder(var, prop, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0(round(100*prop, 2),'%')), nudge_y = .05) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Race') +
ylab('Guideline Followed - Proportion') +
ylim(0,1) # scale_color_discrete('Adherence to Guidelines')
# scale_fill_brewer(type = 'qualitative', palette = 'Paired') +
# theme(legend.position = 'none')There is some slight variation in adherence based on day of the week.
tab <- coh.sub %>%
mutate(var = med_dow) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum)
# turn into a matrix
mat <- tab %>%
select(-sum, -prop) %>%
pivot_wider(
names_from = dur_compl,
values_from = n
) %>%
column_to_rownames('var') %>%
as.matrix()
# Fisher's Exact Test
fisher.test(mat, simulate.p.value = T)##
## Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
##
## data: mat
## p-value = 0.04598
## alternative hypothesis: two.sided
tab %>%
filter(dur_compl == 'Guideline Followed') %>%
# arrange(desc(n)) %>%
# ungroup() %>%
# mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0(round(100*prop, 2),'%')), nudge_y = .05) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Day of Week') +
ylab('Guideline Followed - Proportion') +
ylim(0,1) # scale_color_discrete('Adherence to Guidelines')
# scale_fill_brewer(type = 'qualitative', palette = 'Paired') +
# theme(legend.position = 'none')There is some, although questionably significant variation in adherence to the guidelines by time of day. We initially hypothesized that adherence might be worse during busier times of day as changing the prescribed duration required extra work.
tab <- coh.sub %>%
mutate(var = med_hr) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum)
# turn into a matrix
mat <- tab %>%
select(-sum, -prop) %>%
pivot_wider(
names_from = dur_compl,
values_from = n
) %>%
column_to_rownames('var') %>%
as.matrix()
# Fisher's Exact Test
fisher.test(mat, simulate.p.value = T)##
## Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
##
## data: mat
## p-value = 0.0004998
## alternative hypothesis: two.sided
tab %>%
filter(dur_compl == 'Guideline Followed') %>%
# arrange(desc(n)) %>%
# ungroup() %>%
# mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_line() +
# geom_text(aes(label=paste0('n = ',n,' \n(',round(100*prop, 2),'%)')), nudge_y = .1) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Arrival Hour') +
ylab('Guideline Followed - Proportion') +
ylim(0,1) # scale_color_discrete('Adherence to Guidelines')
# scale_fill_brewer(type = 'qualitative', palette = 'Paired') +
# theme(legend.position = 'none')There seems to be some significant variation in adherence to guidelines based on provider type, noting that medical students are quite underpresented (only 3 in the cohort).
tab <- coh.sub %>%
mutate(var = prov_rx) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum)
# turn into a matrix
mat <- tab %>%
select(-sum, -prop) %>%
pivot_wider(
names_from = dur_compl,
values_from = n
) %>%
column_to_rownames('var') %>%
as.matrix()
# Fisher's Exact Test
fisher.test(mat, simulate.p.value = T)##
## Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
##
## data: mat
## p-value = 0.0004998
## alternative hypothesis: two.sided
tab %>%
filter(dur_compl == 'Guideline Followed' & !is.na(var)) %>%
arrange(desc(n)) %>%
ungroup() %>%
mutate(var = fct_reorder(var, prop, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0(round(100*prop, 2),'%')), nudge_y = .05) +
# scale_y_log10() +
theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('Provider Type') +
ylab('Guideline Followed - Proportion') +
ylim(0,1) # scale_color_discrete('Adherence to Guidelines')
# scale_fill_brewer(type = 'qualitative', palette = 'Paired') +
# theme(legend.position = 'none')There was slightly better adherence to the guidelines when the Order Set was not used. This is perhaps not surprising since the Order Set defaulted to 10 days.
tab <- coh.sub %>%
mutate(var = ss_used_med) %>%
group_by(var, dur_compl) %>%
summarise(n = n()) %>%
mutate(sum = sum(n),
prop = n/sum)
# turn into a matrix
mat <- tab %>%
select(-sum, -prop) %>%
pivot_wider(
names_from = dur_compl,
values_from = n
) %>%
column_to_rownames('var') %>%
as.matrix()
# Fisher's Exact Test
fisher.test(mat)##
## Fisher's Exact Test for Count Data
##
## data: mat
## p-value = 8.387e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.6219834 0.8146188
## sample estimates:
## odds ratio
## 0.7111043
tab %>%
filter(dur_compl == 'Guideline Followed') %>%
# arrange(desc(n)) %>%
# ungroup() %>%
# mutate(var = fct_reorder(var, n, .desc = T)) %>%
ggplot(aes(x = var, y = prop)) +
geom_col() +
geom_text(aes(label=paste0(round(100*prop, 2),'%')), nudge_y = .05) +
# scale_y_log10() +
# theme(axis.text.x = element_text(angle=-60, hjust = 0.1)) +
xlab('') +
ylab('Guideline Followed - Proportion') +
ylim(0,1)With all variables included in the model, several features remained significant. As might be expected from univariable analyses above, these included:
coh.mod_var <-
coh.sub %>%
select(dur_compl, abx_name , race_f, age_yr, sex_f, MED_COMPLEX_IND, med_hr_cut, med_dow, arr_yr, ss_used_med, prov_rx) %>%
drop_na() %>%
mutate(race_f = factor(race_f), # to account for dropped NA
med_dow = factor(med_dow, ordered = F) # so that it displays correctly
)
f.full <- glm(dur_compl ~ ., family = binomial, data = coh.mod_var)
summary(f.full)##
## Call:
## glm(formula = dur_compl ~ ., family = binomial, data = coh.mod_var)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6770 -0.5584 -0.3947 -0.2668 2.9230
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.461e+02 2.683e+01 -24.080 < 2e-16 ***
## abx_nameAmoxicillin Clavulanate -2.729e-01 1.030e-01 -2.649 0.00806 **
## abx_nameCefdinir 2.227e-01 1.250e-01 1.781 0.07486 .
## abx_nameClindamycin 2.634e-01 3.311e-01 0.796 0.42625
## race_fAsian 3.854e-01 1.097e+00 0.351 0.72534
## race_fBlack or African American 1.909e-01 1.088e+00 0.175 0.86074
## race_fIndian 5.338e-01 1.119e+00 0.477 0.63348
## race_fMulti-racial -6.133e-02 1.103e+00 -0.056 0.95564
## race_fNative Hawaiian or Other Pacific Islander 1.529e+00 1.312e+00 1.166 0.24371
## race_fOther 4.649e-02 1.090e+00 0.043 0.96599
## race_fRefused 4.635e-01 1.368e+00 0.339 0.73485
## race_fWhite 8.022e-02 1.090e+00 0.074 0.94135
## age_yr 8.318e-02 7.264e-03 11.451 < 2e-16 ***
## sex_fM 2.962e-02 5.495e-02 0.539 0.58985
## MED_COMPLEX_INDCCC Present -3.510e-01 1.693e-01 -2.073 0.03814 *
## med_hr_cut[3,6) 1.547e-01 1.279e-01 1.210 0.22635
## med_hr_cut[6,9) 2.087e-01 1.271e-01 1.641 0.10076
## med_hr_cut[9,12) 2.802e-01 1.106e-01 2.533 0.01132 *
## med_hr_cut[12,15) 2.136e-01 1.102e-01 1.937 0.05269 .
## med_hr_cut[15,18) 7.875e-02 1.132e-01 0.696 0.48658
## med_hr_cut[18,21) -1.665e-01 1.103e-01 -1.509 0.13127
## med_hr_cut[21,24) -1.119e-01 1.082e-01 -1.034 0.30130
## med_dowMon 1.570e-01 9.971e-02 1.575 0.11530
## med_dowTue 1.221e-01 1.009e-01 1.210 0.22620
## med_dowWed 5.075e-02 1.031e-01 0.492 0.62246
## med_dowThu 1.825e-01 1.015e-01 1.797 0.07231 .
## med_dowFri 2.570e-01 1.009e-01 2.549 0.01082 *
## med_dowSat 2.738e-01 9.619e-02 2.847 0.00442 **
## arr_yr 3.192e-01 1.329e-02 24.016 < 2e-16 ***
## ss_used_medUsed Order Set -1.802e-01 7.349e-02 -2.452 0.01422 *
## prov_rxFellow 1.075e+00 1.059e-01 10.150 < 2e-16 ***
## prov_rxMedical Student 2.103e+00 1.243e+00 1.692 0.09062 .
## prov_rxNP/PA -5.979e-01 8.521e-02 -7.016 2.28e-12 ***
## prov_rxResident 3.162e-01 7.373e-02 4.289 1.80e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10124.5 on 13384 degrees of freedom
## Residual deviance: 9071.5 on 13351 degrees of freedom
## AIC: 9139.5
##
## Number of Fisher Scoring iterations: 5
full.sum <- broom::tidy(f.full, conf.int = T, exponentiate =T)
# full.sum
lr.full <- full.sum %>%
select(term, OR = estimate , conf.2.5 = conf.low, conf.97.5 = conf.high, p.value) %>%
filter(!term %in% c('(Intercept)')) %>%
mutate_at(vars(c(OR:conf.97.5)), function(x) round(x,2)) %>%
mutate_at(vars(p.value), function(x) round(x,3)) %>%
mutate(p.value = ifelse(p.value < 0.001,0.001, p.value))
lr.full %>%
arrange(p.value, OR) %>%
filter(p.value <= 0.05)## # A tibble: 11 x 5
## term OR conf.2.5 conf.97.5 p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 prov_rxNP/PA 0.55 0.46 0.65 0.001
## 2 age_yr 1.09 1.07 1.1 0.001
## 3 prov_rxResident 1.37 1.19 1.58 0.001
## 4 arr_yr 1.38 1.34 1.41 0.001
## 5 prov_rxFellow 2.93 2.38 3.6 0.001
## 6 med_dowSat 1.31 1.09 1.59 0.004
## 7 abx_nameAmoxicillin Clavulanate 0.76 0.62 0.93 0.008
## 8 med_dowFri 1.29 1.06 1.58 0.011
## 9 med_hr_cut[9,12) 1.32 1.07 1.65 0.011
## 10 ss_used_medUsed Order Set 0.84 0.72 0.97 0.014
## 11 MED_COMPLEX_INDCCC Present 0.7 0.5 0.97 0.038
Although the primary goal of this analysis was to determine factors associated with adherence to the guideline, we can also evaluate the model performance and whether it would effectively predict adherence to the guidelines. As shown in this ROC curve, the model is able ot explain some variance in the outcome, but has relatively middling performance with an AUC of ~ 0.73.
##
## Call:
## roc.formula(formula = coh.mod_var$dur_compl ~ predict(f.full, type = "response"), plot = T, print.auc = T)
##
## Data: predict(f.full, type = "response") in 11702 controls (coh.mod_var$dur_compl Prolonged Duration) < 1683 cases (coh.mod_var$dur_compl Guideline Followed).
## Area under the curve: 0.7329
Although not a primary outcome, we were interested in whether the chosen covariates could be used to predict which antibiotic would be prescribed. We used several machine learning models to explore this - penalized multinomial regression, naive Bayes, and random forest.
# choose most common antibiotics
meds.coh %>%
group_by(abx_name) %>%
summarise(n = n()) %>%
arrange(desc(n))## # A tibble: 16 x 2
## abx_name n
## <fct> <int>
## 1 Amoxicillin 22418
## 2 Amoxicillin Clavulanate 3006
## 3 Cefdinir 1041
## 4 Azithromycin 440
## 5 Clindamycin 144
## 6 Cephalexin 29
## 7 Cefuroxime 26
## 8 Sulfamethoxazole Trimethoprim 12
## 9 Cefixime 10
## 10 Ciprofloxacin 10
## 11 Levofloxacin 8
## 12 Cefprozil 6
## 13 Cefpodoxime 2
## 14 Ceftriaxone 2
## 15 Erythromycin 1
## 16 Penicillin V 1
top.meds <- meds.coh %>%
group_by(abx_name) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(n > 150) %>%
pull(abx_name)
# select covariates of interest
abx.coh <- meds.coh %>%
filter(abx_name %in% top.meds) %>%
select(abx_name , race_f, age_yr, sex_f, MED_COMPLEX_IND, arr_yr, ss_used_med, prov_rx, pcn_all, azith_all, cef_all) %>%
drop_na() %>%
mutate(race_f = factor(race_f), # to account for dropped NA
abx_name = factor(abx_name) # account for dropped names
)
# create dummy variables for multi-level factors
dummies <- dummyVars(abx_name ~ ., data=abx.coh)
# head(predict(dummies, abx.coh), 10)
dummies <- as_tibble(predict(dummies, abx.coh))
abx.coh <- cbind(abx_name = abx.coh$abx_name, dummies)
# use caret package to split data into test and training sets (70%/30%)
set.seed(54321)
idxtrain <- createDataPartition(y = abx.coh$abx_name, p = 0.7, list =F)
train <- abx.coh[idxtrain,]
testing <- abx.coh[-idxtrain,]
train.resamp <- UBL::SmoteClassif(abx_name ~ ., dat = train, C.perc = "balance", dist = "HEOM")
# balance results in balancing classes of antibiotics
# HEOM allwos for both numeric and nominal values in data (default is Euclidian)
# scale and center continuous variables
pp <- preProcess(train.resamp %>% select(age_yr, arr_yr), c('scale','center','nzv'))
transformed <- predict(pp, newdata = train.resamp)
# scale and center continuous variables
trans.test <- predict(pp, newdata = testing)## # weights: 108 (78 variable)
## initial value 26109.467997
## iter 10 value 11104.324156
## iter 20 value 10689.909284
## iter 30 value 10426.404992
## iter 40 value 10281.131569
## iter 50 value 10253.154532
## iter 60 value 10193.319962
## iter 70 value 10170.357743
## iter 80 value 10159.745883
## iter 90 value 10152.461948
## final value 10152.461757
## converged
## Confusion Matrix and Statistics
##
## Reference
## Prediction Amoxicillin Amoxicillin Clavulanate Azithromycin Cefdinir
## Amoxicillin 6717 899 92 262
## Amoxicillin Clavulanate 0 0 0 0
## Azithromycin 3 0 25 12
## Cefdinir 4 2 15 38
##
## Overall Statistics
##
## Accuracy : 0.8403
## 95% CI : (0.8321, 0.8482)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 0.04808
##
## Kappa : 0.0952
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Amoxicillin Class: Amoxicillin Clavulanate Class: Azithromycin Class: Cefdinir
## Sensitivity 0.9990 0.0000 0.189394 0.121795
## Specificity 0.0684 1.0000 0.998110 0.997293
## Pos Pred Value 0.8428 NaN 0.625000 0.644068
## Neg Pred Value 0.9293 0.8883 0.986673 0.965793
## Prevalence 0.8333 0.1117 0.016359 0.038667
## Detection Rate 0.8324 0.0000 0.003098 0.004709
## Detection Prevalence 0.9877 0.0000 0.004957 0.007312
## Balanced Accuracy 0.5337 0.5000 0.593752 0.559544
# Initiate multicore processing to decrease time required to fit models - particularly RF
(cores <- floor(0.9*(parallel::detectCores())))## [1] 7
Here we define parameters that control model training and resampling. There is significant class imbalance in the outcome variable (antibiotic) as amoxicillin is by far the most commonly prescribed. To deal with this, we will use resampling methods so that the models don’t just predict the over-represented class (amoxicillin).
# https://topepo.github.io/caret/subsampling-for-class-imbalances.html#subsampling-during-resampling
# smote.cust <- list(name = "SMOTE with more upsampling",
# func = function (x, y) {
# library(DMwR)
# # library(UBL)
# dat <- if (is.data.frame(x)) x else as.data.frame(x)
# dat$.y <- y
# dat <- SMOTE(.y ~ ., data = dat, perc.over = 500, perc.under = 100)
# # dat <- SmoteClassif(.y ~ ., data = dat, C.perc = C.perc)
# list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
# y = dat$.y)
# },
# first = TRUE)
#
#
# tr_control <- trainControl(## 10-fold CV
# method = "repeatedcv",
# number = 10,
# repeats = 3,
# verboseIter = T,
# sampling = smote.cust)
tr_control <- trainControl(## 10-fold CV
method = "cv",
number = 10,
verboseIter = T)# set seed as random samples are used for cross-validation
set.seed(1234)
# fit model
system.time(
pmr.abx <- train(
abx_name ~ .,
data = transformed,
method = "multinom",
trControl = tr_control,
tuneLength = 10,
metric = 'Kappa'
)
)## Aggregating results
## Selecting tuning parameters
## Fitting decay = 0.00133 on full training set
## # weights: 108 (78 variable)
## initial value 26105.309114
## iter 10 value 22792.969322
## iter 20 value 22585.283832
## iter 30 value 22486.518042
## iter 40 value 22359.827653
## iter 50 value 22295.853376
## iter 60 value 22223.852736
## iter 70 value 22221.741238
## iter 80 value 22221.699435
## final value 22221.696860
## converged
## user system elapsed
## 7.39 2.45 160.14
# show confusion matrix for penalized multinomial regression
(pmr.test <- confusionMatrix(predict(pmr.abx, trans.test), testing$abx_name))## Confusion Matrix and Statistics
##
## Reference
## Prediction Amoxicillin Amoxicillin Clavulanate Azithromycin Cefdinir
## Amoxicillin 2824 280 21 83
## Amoxicillin Clavulanate 1539 314 15 82
## Azithromycin 1472 171 82 58
## Cefdinir 889 136 14 89
##
## Overall Statistics
##
## Accuracy : 0.4101
## 95% CI : (0.3993, 0.4209)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0676
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Amoxicillin Class: Amoxicillin Clavulanate Class: Azithromycin Class: Cefdinir
## Sensitivity 0.4200 0.34850 0.62121 0.28526
## Specificity 0.7145 0.77176 0.78569 0.86606
## Pos Pred Value 0.8803 0.16103 0.04599 0.07890
## Neg Pred Value 0.1977 0.90407 0.99205 0.96787
## Prevalence 0.8333 0.11166 0.01636 0.03867
## Detection Rate 0.3500 0.03891 0.01016 0.01103
## Detection Prevalence 0.3976 0.24167 0.22097 0.13979
## Balanced Accuracy 0.5672 0.56013 0.70345 0.57566
# Grid of model parameters for optimizing fit
tunegrid <- expand.grid(
usekernel = c(TRUE, FALSE),
laplace = 0:3,
adjust = c(0,0.5,1,2,3)
)
# seed for random sampling in CV
set.seed(1234)
# Fit naive bayes model
system.time(
nb.abx <- train(
abx_name ~ .,
data = transformed,
method = "naive_bayes",
trControl = tr_control,
metric = 'Kappa',
tuneGrid = tunegrid
)
)## Aggregating results
## Selecting tuning parameters
## Fitting laplace = 0, usekernel = FALSE, adjust = 0 on full training set
## user system elapsed
## 2.52 1.84 9.22
# show confusion matrix for naive bayes for test data
(nb.test <- confusionMatrix(predict(nb.abx, trans.test), testing$abx_name))## Confusion Matrix and Statistics
##
## Reference
## Prediction Amoxicillin Amoxicillin Clavulanate Azithromycin Cefdinir
## Amoxicillin 5 1 3 0
## Amoxicillin Clavulanate 2368 366 14 115
## Azithromycin 4314 527 112 189
## Cefdinir 37 7 3 8
##
## Overall Statistics
##
## Accuracy : 0.0609
## 95% CI : (0.0557, 0.0663)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0101
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Amoxicillin Class: Amoxicillin Clavulanate Class: Azithromycin Class: Cefdinir
## Sensitivity 0.0007436 0.40622 0.84848 0.0256410
## Specificity 0.9970260 0.65165 0.36626 0.9939410
## Pos Pred Value 0.5555556 0.12784 0.02178 0.1454545
## Neg Pred Value 0.1663772 0.89723 0.99317 0.9620664
## Prevalence 0.8333127 0.11166 0.01636 0.0386665
## Detection Rate 0.0006197 0.04536 0.01388 0.0009914
## Detection Prevalence 0.0011154 0.35481 0.63725 0.0068162
## Balanced Accuracy 0.4988848 0.52893 0.60737 0.5097910
This could take a while to run!
# set seed for random forest
set.seed(1234)
# tuning model parameters
tunegrid <- expand.grid(.mtry = c(3,6,9))
# fit random forest
system.time(
rf.abx <- train(
abx_name ~ .,
data = transformed,
method = "rf",
trControl = tr_control,
ntree = 500,
metric = 'Kappa',
tuneGrid = tunegrid
)
)## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 9 on full training set
## user system elapsed
## 58.34 2.72 296.86
# demonstrate which variables had the highest "importance" in random forest model
# varImp(rf.abx)
# confusion matrix for random forest model
(rf.test <- confusionMatrix(predict(rf.abx, trans.test), testing$abx_name))## Confusion Matrix and Statistics
##
## Reference
## Prediction Amoxicillin Amoxicillin Clavulanate Azithromycin Cefdinir
## Amoxicillin 4271 500 66 151
## Amoxicillin Clavulanate 1771 319 17 84
## Azithromycin 332 34 33 15
## Cefdinir 350 48 16 62
##
## Overall Statistics
##
## Accuracy : 0.5806
## 95% CI : (0.5698, 0.5914)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.071
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Amoxicillin Class: Amoxicillin Clavulanate Class: Azithromycin Class: Cefdinir
## Sensitivity 0.6352 0.35405 0.25000 0.198718
## Specificity 0.4669 0.73884 0.95200 0.946629
## Pos Pred Value 0.8563 0.14560 0.07971 0.130252
## Neg Pred Value 0.2038 0.90099 0.98707 0.967075
## Prevalence 0.8333 0.11166 0.01636 0.038667
## Detection Rate 0.5293 0.03953 0.00409 0.007684
## Detection Prevalence 0.6182 0.27153 0.05131 0.058991
## Balanced Accuracy 0.5511 0.54645 0.60100 0.572673
tunegrid <- expand.grid(
eta = 0.1,
colsample_bytree=c(0.3,0.8,1),
max_depth=c(3,6),
nrounds=100,
gamma=1,
min_child_weight=1,
subsample = c(0.8, 1)
)
# one approach to tuning - https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e
set.seed(1234)
system.time(
xg.abx <- train(
abx_name ~ .,
data = transformed,
method = "xgbTree",
trControl = tr_control,
metric = 'Kappa',
tuneGrid = tunegrid
)
)## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 100, max_depth = 6, eta = 0.1, gamma = 1, colsample_bytree = 0.8, min_child_weight = 1, subsample = 0.8 on full training set
## user system elapsed
## 57.99 16.78 155.89
## Confusion Matrix and Statistics
##
## Reference
## Prediction Amoxicillin Amoxicillin Clavulanate Azithromycin Cefdinir
## Amoxicillin 4764 505 61 162
## Amoxicillin Clavulanate 1726 357 26 91
## Azithromycin 153 20 28 9
## Cefdinir 81 19 17 50
##
## Overall Statistics
##
## Accuracy : 0.6443
## 95% CI : (0.6338, 0.6548)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1133
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Amoxicillin Class: Amoxicillin Clavulanate Class: Azithromycin Class: Cefdinir
## Sensitivity 0.7085 0.39623 0.21212 0.160256
## Specificity 0.4587 0.74289 0.97707 0.984917
## Pos Pred Value 0.8674 0.16227 0.13333 0.299401
## Neg Pred Value 0.2394 0.90731 0.98677 0.966844
## Prevalence 0.8333 0.11166 0.01636 0.038667
## Detection Rate 0.5904 0.04424 0.00347 0.006197
## Detection Prevalence 0.6806 0.27265 0.02603 0.020696
## Balanced Accuracy 0.5836 0.56956 0.59460 0.572587
# Compare accuracy and Kappa for training data
# resamps <- resamples(list(PMR = pmr.abx,
# NB = nb.abx,
# RF = rf.abx))
#
# summary(resamps)
# Accuracy and Kappa for Test data
bind_rows(c(
pmr = pmr.test$overall[c('Accuracy','Kappa')],
nb = nb.test$overall[c('Accuracy','Kappa')],
rf = rf.test$overall[c('Accuracy','Kappa')],
xg = xg.test$overall[c('Accuracy','Kappa')]
)
)## # A tibble: 1 x 8
## pmr.Accuracy pmr.Kappa nb.Accuracy nb.Kappa rf.Accuracy rf.Kappa xg.Accuracy xg.Kappa
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.410 0.0676 0.0609 0.0101 0.581 0.0710 0.644 0.113
# Prep Names for multiroc
prefix <- c('amox','aug','azi','cef')
roc_true <- sapply(prefix,function(x) paste0(x,'_true'))
roc_pred <- sapply(prefix,function(x) paste0(x,'_pred'))
roc_pmr <- sapply(roc_pred,function(x) paste0(x,'_pmr'))
roc_nb <- sapply(roc_pred,function(x) paste0(x,'_nb'))
roc_rf <- sapply(roc_pred,function(x) paste0(x,'_rf'))
roc_xg <- sapply(roc_pred,function(x) paste0(x,'_xg'))
roc_names <- c(roc_pred, roc_pmr, roc_nb, roc_rf, roc_xg)
names(roc_names) <- NULL
# roc_names
# Transform abx vector into one-hot vectors
tmp <- caret::dummyVars('~ abx_name', data=trans.test)
abx_hot <- as_tibble(predict(tmp,trans.test))
names(abx_hot) <- roc_true
# str(abx_hot)
# extract probablities for each class for each method
pmr_pred <- predict(pmr.abx,trans.test,type = 'prob')
nb_pred <- predict(nb.abx,trans.test,type = 'prob')
rf_pred <- predict(rf.abx,trans.test,type = 'prob')
xg_pred <- predict(xg.abx,trans.test,type = 'prob')
names(pmr_pred) <- roc_pmr
names(nb_pred) <- roc_nb
names(rf_pred) <- roc_rf
names(xg_pred) <- roc_xg
# combine data into format expected for multiROC
mroc_data <- cbind(abx_hot, pmr_pred, nb_pred, rf_pred, xg_pred)
mroc_out <- multiROC::multi_roc(mroc_data) #ROC data
pr_out <- multiROC::multi_pr(mroc_data) #precision-recall data
roc_res <- multiROC::plot_roc_data(mroc_out)
pr_res <- multiROC::plot_pr_data(pr_out)Here, we summarise AUC values (test data) by model and type.
# AUC Values
roc_res %>%
select(Group, Method, AUC) %>%
distinct() %>%
pivot_wider(
names_from = Method,
values_from = AUC)## # A tibble: 6 x 5
## Group pmr nb rf xg
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 amox 0.631 0.590 0.589 0.620
## 2 aug 0.583 0.535 0.584 0.606
## 3 azi 0.770 0.701 0.731 0.738
## 4 cef 0.657 0.577 0.655 0.658
## 5 Macro 0.660 0.601 0.640 0.656
## 6 Micro 0.703 0.297 0.833 0.886
# make nicer labels
roc_res <- roc_res %>%
mutate(
Group = factor(Group,levels = c('amox','aug','azi','cef','Micro','Macro'), labels=c('Amoxicillin','Amox/Clav','Azithromycin','Cefdinir','Micro','Macro')),
Method = factor(Method, levels = c('pmr','nb','rf','xg'), labels = c('Penalized Multinomial','Naive Bayes','Random Forest','eXtreme Gradient Boosting'))
)
# plot just micro and macro ROC curves
roc_res %>%
filter(Group %in% c('Micro','Macro')) %>%
ggplot() +
geom_line(aes(y = Sensitivity, x = 1-Specificity, color = Group), size = 1.5) +
geom_abline() +
facet_wrap(~ Method, ncol =2)# Macro ROC takes the average of each class ROC (is less affected by class imbalance because each class contributes equally)
# plot ROC curves for each class
roc_res %>%
filter(!Group %in% c('Micro','Macro')) %>%
ggplot() +
geom_line(aes(y = Sensitivity, x = 1-Specificity, color = Group), alpha = 0.75, size = 1.5) +
geom_abline() +
facet_wrap(~ Method,ncol = 2) +
scale_color_brewer('Antibiotic', type = 'qual', palette = 'Set1')Next, we plot precision-recall curves for these models.
# make nicer labels
pr_res <- pr_res %>%
mutate(
Group = factor(Group,levels = c('amox','aug','azi','cef','Micro','Macro'), labels=c('Amoxicillin','Amox/Clav','Azithromycin','Cefdinir','Micro','Macro')),
Method = factor(Method, levels = c('pmr','nb','rf','xg'), labels = c('Penalized Multinomial','Naive Bayes','Random Forest','eXtreme Gradient Boosting'))
)
# plot just micro and macro curves
pr_res %>%
filter(Group %in% c('Micro','Macro')) %>%
ggplot() +
geom_line(aes(y = Precision, x = Recall, color = Group), size = 1.5) +
facet_wrap(~ Method, ncol =2)# plot PR curves for each class
pr_res %>%
filter(!Group %in% c('Micro','Macro')) %>%
ggplot() +
geom_line(aes(y = Precision, x = Recall, color = Group), size = 1.5) +
facet_wrap(~ Method, ncol =2) +
scale_color_brewer('Antibiotic', type = 'qual', palette = 'Set1')Initially, models were trained on unbalanced training data. Accuracies of 80%+ were obtained, but essentially the models always predicted the majority class (amoxicillin). To address this, we took two appraoches: 1) we used sampling methods (SMOTE) to improve class balance and 2) hyper-parameter selection was based on kappa rather than accuracy. SMOTE resulted in better results than pure downsampling. Upsampling was too inefficient to tune hyper-parameters and did not result in significant-enough improvements to justify the addditional time spent training models. Overall, despite attempts at optimization, model performance remained relatively poor for antibiotic class-prediction. It would potentially be nearly as useful to predict patients which were prescribed any antibiotic other than amoxicillin (resulting in a binary classification). In this case, we might be better able to address the problem with class-imbalance. Alternatively, we could consider whether there are additional features that me helpful. Although there are many, they might include specific co-morbidities (immunodeficiency, chronic serous otitis, history of ear tubes), recent visits to the ED or primary care provider (as an indicator or previously failed treatment).
One of the more challenging aspects of this project was extracting prescribed duration from the prescription “SIG.” Fortunately, there were relatively consistent patterns that could be leveraged to obtain this data. In cases, were this could not be extracted from the SIG, we estimated the duration from the prescribed volume.
Several features (noted above) were found to be associated with adherence to the recommended guidelines. Most importantly given our intent to best target clinical decision support, was the fact that most providers used an Order Set to order antibiotics. Similarly, the association with age was important, both with respect to different guidelines by age and the fact that older patients were likely to receive shorter durations. If our change to the CDS system is not effective, we might need to evaluate the differential adherence between provider types.
While not shown here, we used the baseline evaluation to support a decision to provide age-based CDS for antibiotics in the Order Set for ear complaints from the ED. In just a few weeks, we have observed an increase in adherence to the guidelines for patients >= 2 years to 70-80%. Ongoing analyses will determine whether this change has had any unintended consequences such as driving providers away from using the Order Set or to using broader spectrum (non-amoxicillin) antibiotics.
As noted above, we attempted to predict which antibiotic a patient was likely to be prescibred. We had limited success in doing so based on currently available features (covariates). To improve these models, additional features will likely be required.